声乐爆发在交流情感中起着重要的作用,使它们对于改善语音情感识别很有价值。在这里,我们介绍了我们在ACII情感声乐爆发工作室和挑战2022(A-VB)中预测声音爆发并预测其情感意义的方法。我们使用大型的自我监督音频模型作为共享的功能提取器,并比较在分类器链和注意力网络上构建的多个体系结构,并结合不确定性减少减肥策略。我们的方法超过了所有四个任务的挑战基线。
translated by 谷歌翻译
我们提出了一种新型的动态约束不确定性加权损失,以实验处理平衡多个任务对ICML EXVO 2022挑战的贡献的问题。多任务旨在共同认识到声乐爆发中表达的情绪和人口特征。我们的策略结合了不确定性重量和平均动态重量的优势,通过用约束术语扩展权重以使学习过程更具解释。我们使用轻巧的多EXIT CNN体系结构来实施我们提出的损失方法。实验性H-均值得分(0.394)显示出比基线H均值得分的显着改善(0.335)。
translated by 谷歌翻译
爆发两年多后,Covid-19的大流行继续困扰世界各地的医疗系统,给稀缺资源带来压力,并夺走了人类的生命。从一开始,已经采用了各种基于AI的CoVID-19检测和监测工具,以试图通过及时诊断来阻止感染的潮流。特别是,已经建议计算机试听是一种非侵入性,成本效益和环保的替代方法,可通过声音通过声音来检测COVID-19的感染。但是,像所有AI方法一样,计算机试镜也很大程度上取决于可用数据的数量和质量,并且由于此类数据的敏感性,大规模的COVID-19声音数据集很难获取 - 除其他原因外。为此,我们介绍了COVYT数据集 - 一种新颖的Covid-19数据集,该数据集是从包含来自65位演讲者的8个小时以上语音的公共资源中收集的。与其他现有的COVID-19声音数据集相比,COVYT数据集的独特功能是,它包括所有65位扬声器的covid-19正和负样本。我们使用可解释的音频描述来分析Covid-19的声学表现,并使用可解释的音频描述,并研究几种分类场景,并调查一些分类场景,以将基于公平的言语的COVID进行适当的分配策略-19检测。
translated by 谷歌翻译
在这项工作中,我们探索了一种小说的几弹性个性化体系结构,以进行情感发声预测。核心贡献是一个“注册”编码器,它利用目标扬声器的两个未标记的样本来调整情感编码器的输出。调整基于点产生的注意力,因此有效地充当“软”特征选择的一种形式。情感和注册编码器基于两个标准音频体系结构:CNN14和CNN10。这两个编码器进一步指导忘记或学习辅助情感和/或说话者信息。我们最好的方法在EXVO少量开发套件上达到了CCC $ .650 $,比我们的基线CNN14 CCC $ 2.5 \%$增加了$ .634 $。
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
Interview has been regarded as one of the most crucial step for recruitment. To fully prepare for the interview with the recruiters, job seekers usually practice with mock interviews between each other. However, such a mock interview with peers is generally far away from the real interview experience: the mock interviewers are not guaranteed to be professional and are not likely to behave like a real interviewer. Due to the rapid growth of online recruitment in recent years, recruiters tend to have online interviews, which makes it possible to collect real interview data from real interviewers. In this paper, we propose a novel application named EZInterviewer, which aims to learn from the online interview data and provides mock interview services to the job seekers. The task is challenging in two ways: (1) the interview data are now available but still of low-resource; (2) to generate meaningful and relevant interview dialogs requires thorough understanding of both resumes and job descriptions. To address the low-resource challenge, EZInterviewer is trained on a very small set of interview dialogs. The key idea is to reduce the number of parameters that rely on interview dialogs by disentangling the knowledge selector and dialog generator so that most parameters can be trained with ungrounded dialogs as well as the resume data that are not low-resource. Evaluation results on a real-world job interview dialog dataset indicate that we achieve promising results to generate mock interviews. With the help of EZInterviewer, we hope to make mock interview practice become easier for job seekers.
translated by 谷歌翻译
Dynamic treatment regimes assign personalized treatments to patients sequentially over time based on their baseline information and time-varying covariates. In mobile health applications, these covariates are typically collected at different frequencies over a long time horizon. In this paper, we propose a deep spectral Q-learning algorithm, which integrates principal component analysis (PCA) with deep Q-learning to handle the mixed frequency data. In theory, we prove that the mean return under the estimated optimal policy converges to that under the optimal one and establish its rate of convergence. The usefulness of our proposal is further illustrated via simulations and an application to a diabetes dataset.
translated by 谷歌翻译
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.
translated by 谷歌翻译
Human parsing aims to partition humans in image or video into multiple pixel-level semantic parts. In the last decade, it has gained significantly increased interest in the computer vision community and has been utilized in a broad range of practical applications, from security monitoring, to social media, to visual special effects, just to name a few. Although deep learning-based human parsing solutions have made remarkable achievements, many important concepts, existing challenges, and potential research directions are still confusing. In this survey, we comprehensively review three core sub-tasks: single human parsing, multiple human parsing, and video human parsing, by introducing their respective task settings, background concepts, relevant problems and applications, representative literature, and datasets. We also present quantitative performance comparisons of the reviewed methods on benchmark datasets. Additionally, to promote sustainable development of the community, we put forward a transformer-based human parsing framework, providing a high-performance baseline for follow-up research through universal, concise, and extensible solutions. Finally, we point out a set of under-investigated open issues in this field and suggest new directions for future study. We also provide a regularly updated project page, to continuously track recent developments in this fast-advancing field: https://github.com/soeaver/awesome-human-parsing.
translated by 谷歌翻译
A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis. Creating video storyboards however remains challenging which not only requires association between high-level texts and images, but also demands for long-term reasoning to make transitions smooth across shots. In this paper, we propose a new task called Text synopsis to Video Storyboard (TeViS) which aims to retrieve an ordered sequence of images to visualize the text synopsis. We construct a MovieNet-TeViS benchmark based on the public MovieNet dataset. It contains 10K text synopses each paired with keyframes that are manually selected from corresponding movies by considering both relevance and cinematic coherence. We also present an encoder-decoder baseline for the task. The model uses a pretrained vision-and-language model to improve high-level text-image matching. To improve coherence in long-term shots, we further propose to pre-train the decoder on large-scale movie frames without text. Experimental results demonstrate that our proposed model significantly outperforms other models to create text-relevant and coherent storyboards. Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
translated by 谷歌翻译